Documentation Fundamentals

Markdown, README, and Codebooks

Published

January 17, 2026

Why Documentation Matters

Good documentation essential for reproducible research. Without it, even you won’t understand your own work six months later. Documentation serves three audiences: your future self, your collaborators, and the broader research community.

Key Definitions

Before diving in, let’s clarify terms that are often used interchangeably but mean different things:

Term What it is Typical format
README Project overview and setup instructions .md, .txt, .pdf
Codebook Detailed variable-level documentation .pdf, .xlsx
Data dictionary Technical specification of variables (often synonymous with codebook) .xlsx, .csv, .txt
Data lineage The path data takes from source to final form Diagram or narrative
Metadata Data about data (when collected, by whom, how) Various

About Markdown

Markdown is a lightweight markup language that’s become the standard for documentation in data science. It’s readable as plain text but renders nicely in browsers and editors.

Tools for working with Markdown

  • Quarto — the successor to R Markdown, works with R, Python, Julia
  • Online Markdown editor — for quick testing
  • Pandoc — converts between formats (md → docx, pdf, html)
  • Dillinger — another online editor with live preview

Quick Markdown reference

# Heading 1
## Heading 2
**bold** and *italic*
- bullet point
1. numbered list
[link text](url)
`inline code`

What is a Good README?

A README is the front door to your project. Someone should be able to understand what your project does, how to use it, and where to find things—all from reading the README.

Key Ingredients

A complete README for a research project should include:

  1. Overview — What is this project? What question does it answer?
  2. Data sources — Where does the data come from? Any access restrictions?
  3. File structure — What’s in each folder? Which scripts run in what order?
  4. Requirements — Software, packages, and versions needed
  5. Instructions — How to run the analysis from start to finish
  6. License — Terms for reuse (MIT, CC-BY, etc.)
  7. Contact — Who to ask questions

Data Description Checklist

For each dataset in your project, document:

  • Name and file format (csv, parquet, xlsx)
  • Number of observations and variables
  • Unit of observation (person, firm-year, country-month)
  • Time coverage and geographic scope
  • Key variables with brief descriptions
  • Missing data: how much and why
  • Data lineage: source → processing → final structure

Examples of Good READMEs

Reproduction packages

Templates and guides

What is a Codebook (Variable Dictionary)?

A codebook provides detailed, variable-level documentation. While the README gives the big picture, the codebook tells you exactly what Q47_recoded means.

What to Include for Each Variable

Element Example
Variable name income_hh
Label Household monthly income
Type Numeric (continuous)
Unit/metric EUR, monthly
Valid range 0–999999
Coding for categories 1=Low, 2=Medium, 3=High
Missing values -99 = refused, NA = not asked
Share missing 4.2%
Notes Top-coded at 99th percentile

Examples of Good Codebooks

Tips for AI-Assisted Documentation

LLMs can significantly speed up documentation, but require careful verification.

What AI does well

  • Summarizing long codebooks
  • Generating first drafts of variable descriptions
  • Suggesting what’s missing from your documentation
  • Converting between formats (e.g., codebook PDF → markdown table)

What requires human oversight

  • Verifying variable definitions match actual data
  • Checking that coded values (1, 2, 3…) match the stated meaning
  • Ensuring coverage statistics are accurate
  • Confirming data lineage is correct

A practical workflow

  1. Start by having a first look to get a feel. Look at documensts, check the index. Open data.
  2. Upload your codebook/data to the LLM
  3. Ask for a structured summary
  4. Verify 3-5 variables manually against the source
  5. Iterate: ask AI to fix errors you find
  6. Final human review before publishing

Further Reading